2024 鐵人賽 Day14: Significant-Text Aggregation again( ?

2024 iThome 鐵人賽

DAY 0

自我挑戰組

重新開始 elasticsearch 系列第 13 篇

16th鐵人賽

kimcheng

2024-09-29 23:39:52

91 瀏覽

分享至

上一篇說到用 significant text aggregation 補完沒有打完的字；那補完這個字了，下一步是要找出這個字的延伸字，也就是如果是 covid，那 covid 後面應該會接什麼呢？這個思考方向蠻單純的，就是找跟 covid 這個詞最相關的，那在 es 要怎麼做呢?

看標題應該有猜到，又是 significant text！不過這次把 include 改成 exclude ：

GET /covid19_tweets/_search
{
  "query": {
    "match": {
      "tweet": "covid"
    }
  },
  "aggs": {
    "tags": {
      "significant_text": {
        "field": "tweet",
        "**exclude**": "covid",
        "size": 3,
        "min_doc_count": 100
      }
    }
  }
}

exclude 就是找出的 token 不要包含某個或某些 token 的設定，在這邊是希望把搜尋詞本身（covid）去掉，也就是建議的接續詞內不應包含搜尋詞本身。

這次得到的結果蠻不理想的：

{
  "took": 801,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 2.9927912,
    "hits": [
      ... (略) ...
    ]
  },
  "aggregations": {
    "tags": {
      "doc_count": 32250,
      "bg_count": 354600,
      "buckets": [
        {
          "key": "19",
          "doc_count": 26206,
          "score": 3.579179955827552,
          "bg_count": 53314
        },
        {
          "key": "https",
          "doc_count": 19696,
          "score": 0.10720518675055014,
          "bg_count": 184226
        },
        {
          "key": "t.co",
          "doc_count": 19696,
          "score": 0.10703375846470577,
          "bg_count": 184270
        }
      ]
    }
  }
}

得到了 19 、 https 、 t.co 這樣的結果，其中除了 19 是合理外， https 和 t.co 感覺對使用者不太有幫助；主要是因為 tweeter 的內容除了文字之外，也存在很多網站連結，並且在我們的 index setting 中沒有指定 analyzer，ES 使用了預設的 standard analyzer，這個 analyzer 會將 url 文字部分切取成 token，因此只要有連結的 tweet 就會存在 https。

我們在 index setting 中嘗試定義自己的 analyzer ，看是不是可以讓結果變的理想一點：

{"analysis": {
      "analyzer": {
        "index_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase"
          ],
          "char_filter": [
            "pun_filter"
          ]
        }
      }, 
      "char_filter": {
        "pun_filter": {
              "type": "mapping",
              "mappings": [
                "/=> ",
                ".=>",
                "ー=>"
              ]
            }
      }
    }}

修改 mappings 於 tweet field 使用 settings 中的 index_analyzer：

{
"properties": {

...(略)...

**"tweet": {"type": "text", "analyzer": "index_analyzer"},**

...(略)...
}

刪除原本的 index，修改寫入資料的 python script 如下後再執行一次：

...(略)...

index_mapping_file = 'index_mapping.json'
**index_setting_file = 'index_settings.json' # 新增這行**

...(略)...

# read index mapping file
with open(index_mapping_file, 'r') as f:
    index_mapping = json.load(f)
**# read index setting file                  # 新增這段
with open(index_setting_file, 'r') as f:
    index_setting = json.load(f)**

...(略)...

# create index
**r = es_cli.indices.create(index=index_name, mappings=index_mapping, settings=index_setting) # 修改這一行**

...(略)...

得到了這樣的結果：

{
  "took": 510,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 3.009927,
    "hits": [
      ...(略)...
    ]
  },
  "aggregations": {
    "tags": {
      "doc_count": 15739,
      "bg_count": 88210,
      "buckets": [
        {
          **"key": "19",**
          "doc_count": 12673,
          "score": 3.624382620593038,
          "bg_count": 12911
        },
        {
          **"key": "the",**
          "doc_count": 8164,
          "score": 0.10506112232239737,
          "bg_count": 38049
        },
        {
          **"key": "pandemic",**
          "doc_count": 1026,
          "score": 0.09701440718573524,
          "bg_count": 2311
        }
      ]
    }
  }
}

看起來比較合理了，但是還是有一些可以優化的地方：